Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): expose create to DeltaTable class #1912

Closed
wants to merge 11 commits into from

Conversation

ion-elgreco
Copy link
Collaborator

Description

Allows one to create a table without writing.

Related Issue(s)

@github-actions github-actions bot added the binding/python Issues for the Python package label Nov 26, 2023
@ion-elgreco ion-elgreco changed the title feat(python): expose create to DeltaTable class feat(python): expose create to DeltaTable class Nov 26, 2023
@MrPowers
Copy link
Contributor

This super exciting!! What's the full list of table features this syntax should support / should be able to support in the future?

  • generated columns
  • column checks
  • anything else?

This is great!

@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Nov 26, 2023

@MrPowers I think those two at least.

For the constraints we could probably do something like this:

check_constraints: Dict[str, str]

Example: {"ageispositive": "age >= 0"}

For generated columns, I need to think about what's ideal there, depends also a bit on the implementation in the rust side.

Maybe:

{'col1': {'dtype': 'str', 'expr': 'concat(col2, col3)'

@MrPowers
Copy link
Contributor

@ion-elgreco - cool, awesome, just wanted to make sure we're doing what we can to make this interface future-proof!

@ion-elgreco
Copy link
Collaborator Author

@MrPowers perhaps we can expand the Delta Schema class to define generated columns, then they can be passed together with the rest of the schema.

@ion-elgreco ion-elgreco force-pushed the feat/expose_create_api branch from 7499a90 to a358aa3 Compare November 26, 2023 14:48
@ion-elgreco ion-elgreco force-pushed the feat/expose_create_api branch from a358aa3 to 18156fb Compare November 29, 2023 08:31
@r3stl355
Copy link
Contributor

Looks good to me (after the docstrings are added :))

ion-elgreco and others added 10 commits December 1, 2023 13:10
- Adds rust writer as additional engine in python
- Adds overwrite schema functionality to the rust writer. @roeap feel
free to point out improvements 😄

A couple gaps will exist between current Rust writer and pyarrow writer.
We will have to solve this in a later PR:
- Replacewhere (partition filter / predicate) overwrite
(users however can solve this by doing DeltaTabel.delete and then
append)

- closes delta-io#1861

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
Co-authored-by: David Blajda <[email protected]>
Co-authored-by: Nikolay Ulmasov <[email protected]>
Co-authored-by: Matthew Powers <[email protected]>
Co-authored-by: Thomas Frederik Hoeck <[email protected]>
Co-authored-by: Adrian Ehrsam <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Marijn Valk <[email protected]>
# Description
Current implementation of `ObjectOutputStream` does not invoke flush
when writing out files to Azure storage which seem to cause intermittent
issues when the `write_deltalake` hangs with no progress and no error.

I'm adding a periodic flush to the write process, based on the written
buffer size, which can be parameterized via `storage_options` parameter
(I could not find another way without changing the interface). I don't
know if this is an acceptable approach (also, it requires string values)

Setting the `"max_buffer_size": f"{100 * 1024}"` in `storage_options`
passed to `write_deltalake` helps me resolve the issue with writing a
dataset to Azure which was otherwise failing constantly.

Default max buffer size is set to 4MB which looks reasonable and used by
other implementations I've seen (e.g.
https://github.com/fsspec/filesystem_spec/blob/3c247f56d4a4b22fc9ffec9ad4882a76ee47237d/fsspec/spec.py#L1577)

# Related Issue(s)
Can help with resolving delta-io#1770

# Documentation
If the approach is accepted then I need to find the best way of adding
this to docs

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description
Save user from ending up with failed `load` function call and new folder created - failing fast in case user is trying to load some path that doesn't exist

# Related Issue(s)
- closes delta-io#1916
# Description
A second attempt to extend the write_deltalake to accept either PyArrow
or Deltalake schema (messed up the previous PR with some rebase issues)
Added a test

# Related Issue(s)
closes delta-io#1862

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description

Adds a documentation page on the Delta Lake Arrow integration.
According to the issue test should fail to load table without snapshot (version 0) but test is written to test that it is possible to read and load Delta Table with version 0 into the Rust (functions `open_table` and `open_table_with_version` work)
…ings (delta-io#1895)

Delta protocol specifies 2 possible formats for timestamp partitions:
{year}-{month}-{day} {hour}:{minute}:{second} or {year}-{month}-{day}
{hour}:{minute}:{second}.{microsecond}

However, string comparison of partition filter value and partition
values was performed, which rendered timestamps like 2020-12-31
23:59:59.000000 and 2020-12-31 23:59:59 as different.

This change uses timestamp comparison instead of string comparison.

Co-authored-by: Igor Borodin <[email protected]>
@ion-elgreco ion-elgreco requested a review from rtyler as a code owner December 1, 2023 16:35
@github-actions github-actions bot added binding/rust Issues for the Rust crate crate/core labels Dec 1, 2023
@ion-elgreco
Copy link
Collaborator Author

Created a new PR here: #1932. since I keep messing up the rebase for some reason.

@ion-elgreco ion-elgreco closed this Dec 1, 2023
@ion-elgreco ion-elgreco deleted the feat/expose_create_api branch December 1, 2023 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add API to create an empty Delta Lake table
5 participants